Using Models for Data Analytics

Most of us are familiar with physical models such as model airplanes or model cars. What we often don’t realize is that when we are working with spreadsheets and analysis of data, we are also working with models. First let’s define the concept of a model, and then give some examples.

A model is a representation or abstraction of a real-world system or process. Because the real world is so complex, it is impossible to include every aspect of an actual system in a model. Hence, one important factor in developing or using a model is to make sure the correct elements are included in the model but also that unimportant factors are not included. Including unnecessary elements in a model inserts distracting complexity and causes confusion. Leaving out important factors yields a model that is incomplete and often unable to provide the desired understanding or prediction.

Types of Models

There are many different types and ways to format and present models. Let’s discuss four different types of models:

Textual Models. A textual model is a written description of some aspect of the real world. For example, the written description of an accident, or an accident report, can be considered a textual model of a real-world happening. When analysts are trying to describe some process or system, often a written description is the best way to present it. Textual models often can present details and explanations that cannot be easily understood by other models.

Physical Models. As mentioned earlier, we are familiar with physical models such as a model airplane. Another example that you may have seen is a physical model of a new building or shopping mall. These physical architectural models are very elaborate with houses, roads, cars, trees, and even miniature people. Figure 0.3 illustrates an example of an architectural model of a home.

Figure 0.3: Example of Architectural Model

siraanamwong © 123RF.COM

In this case, this model represents a portion of the potential real world, i.e., something that is potentially going to be built. Note that it only includes some aspects of the real world. In this case, it shows the outside physical walls and structure. Not included in the model are the interior design, the electrical wiring, the plumbing, the HVAC, and so forth. This is true of all models. They only represent a portion of the real world.

Graphical Models. A graphical depiction of a process or concept is sometimes the most powerful way to describe and explain a complex idea. The concept that a picture is worth a thousand words sometimes is very true when trying to model a complex process. One place we often use graphical models is to describe processes that include data and data relationships.

For example, assume we have two sets of data, which can either be in a database or in multiple workbooks in a spreadsheet. The two datasets are Number of Homes Purchased and Average Purchase Price. Analysts often use a graphical model such as Figure 0.4 to show how these two sets of data relate to each other.

Figure 0.4: Example of Graphical Model

Also note that this model only captures a few pieces of data about each entity. For Number of Homes Purchased other interesting attributes could include information about locations, types of homes (single family versus multiunit homes), time on market, new versus existing, and so forth.  Similar type of information may also be interesting about Average Selling Price, plus more information about pricing such as listing price versus final selling price, and whether the price had been adjusted.

One of the major values of modeling is that it permits the analyst to focus only on those characteristics that are of interest and important. Of course, that is one of the challenges of good modeling—to identify and include the characteristics that are important.

The various types of charts that can be produced by Excel are also graphical models. In Lesson 5 we review many different charts that represent real-world data of various events or processes. As you will note, each type of chart has strengths and weaknesses, and certain charts are better tools to help you visualize and understand specific conditions.

Mathematical Models. In many instances, the best way to describe a process or event is with a mathematical equation. In fact, much of the work we will do in this course using Excel is based on mathematical models. One of the most difficult challenges that analysts face is how to develop the correct equation, or model, to describe and understand a real-world process. There are two types of parameters in a mathematical equation, independent variables and dependent variables. The independent variables are the inputs to the equation, and the dependent variable is the result or answer variable.

In order to develop an effective model, the analyst must identify the correct set of independent variables that determine and predict the dependent variable accurately. For example, let’s write an equation for the expected monthly sales based on last year’s sales.

MarchSales = 1.10 * PrevMarchSales

This model anticipates that sales for this year will be 10% higher than last year’s sales. A very simple formula. But is it complete? What if there is a depression in the economy? What if we also have initiated an advertising campaign? What if the supply of our product is limited in some way? Maybe the equation should be more like this:

MarchSales = (1.10 * PrevMarchSales + .10 * Advertising) * EconomicGrowthFactor

In this case, we assume, or have determined based on previous analysis, that Advertising increases sales by 10% of the Advertising budget, with the entire amount modified by some Economic Growth Factor. We have ignored the supply issue. Is this a better equation? We will have to test it to see, and that is one of the analysis procedures we will learn in later chapters.

In this model, the independent variables are PrevMarchSales, Advertising, and EconomicGrowthFactor. These are the input values to the equation. The dependent variable, or output, is the MarchSales variable. Figure 0.5 shows the way it is used in an Excel spreadsheet.

Figure 0.5: Spreadsheet of Mathematical Sales Model

In the figure, the independent variable “Sales 2020” is an input, but of course is given by the actual data for the previous year. The independent variable “2021 Advertising” is an estimated number and can be controlled by what is actually expended. The independent variable “Economic Growth Factor” is an expected value that has been forecasted by economists. It could also change based on actual values. The dependent variable “Expected Sales 2021” is the column that contains the formula for the mathematical model that was defined.

Data-Driven Decision Models and Excel

One important question about models that we have not addressed yet is, “Why do we even build models?” In other words, what is the purpose or utility of model building? Considering the purpose of model building, we find two primary objectives or two categories of model objectives: (1) explanatory models and (2) decision models.

An explanatory model is useful to understand and explain a phenomenon. For example, weather models are useful for understanding and explaining the current weather conditions. They are also useful to anticipate or predict future weather conditions. However, a weather model does not give us the ability to make changes to modify the weather. It helps to predict, but not to make decisions that change the outcome. (The only decision is whether to change our personal plans for activities or not.)

As you go through this course, you will see many different examples of explanatory models. Many graphical models are used in presentations or reports to describe business conditions. Excel dashboards are also often used to monitor and track progress.

A decision model, on the other hand, is a model that includes input variables that affect the outcome and that can be used to make decisions. A decision model is used to produce a more quantitative analysis of a process in order to make more precise decisions. For example, doing a “what if” analysis would be considered a decision model. The observations of the various “what if” scenarios will enable a more accurate decision of the best outcome. Figure 0.5 could be considered a simple decision model because it can be used to help decide on advertising budgets to achieve a specific sales objective.

Data-driven modeling is the process used to make decisions for the organization based on actual data and not on intuition or logic alone. Even though in today’s interconnected electronic environment, massive amounts of data are collected, often referred to as “Big Data,” and require specialized database engines and software, Excel is also a very useful tool to develop data-driven decision models.

One of the most important aspects of using Excel to develop decision models is Excel’s strong data manipulation capabilities. Models built in Excel are “data-driven” models because they utilize spreadsheets of data to develop the model and perform the analysis. Excel spreadsheets are a very powerful tool to develop data-driven decision models because they are so flexible and powerful in manipulating data, analyzing various “what if” scenarios, and also in presenting results with graphical charts or mathematical equations.

Return on Investment Decision Model

Suppose a software company has identified three potential investment options for new phone apps. There are many considerations that go into this type of decision, but there are two important financial considerations. The first consideration is how much will be required to develop the software. A second consideration is which software app will generate the most return on investment (ROI). ROI analysis is a very common technique to evaluate potential investments.

For simplicity, we will simply name these three projects A, B, and C. For each of the projects, the business analyst has estimated the development costs required and also estimated the expected revenue from each project in each of the first five years of sales. Because all of these numbers are only estimates, they have been rounded to the nearest ten thousand dollars. Figure 0.6 shows the Excel spreadsheet that documents these estimates.

Figure 0.6: Table of Revenue Projections

As can be seen, Column A identifies the years, while Row 1 labels the projects. Row 8 contains the estimate of the initial investment. Row 7 contains the five-year total revenue for each project. Finally, Row 9 documents the ROI for each project.

Based on this table, which investment would be the best? Project A generates the least total revenue and has the lowest ROI of 188%. However, Project B and Project C have conflicting results. Project B generates the most total income ($400,000), but Project C yields the best ROI (316%). Even though the decision may be a difficult one for the company, it should be evident that this decision model does provide important information to assist in the decision.

Figure 0.7: Chart of Revenue Projections

More insight can also be obtained if we provide a chart of the three different revenue streams. Figure 0.7 illustrates the chart for these revenue streams. Based on the ROI numbers and the trendline from the chart, it appears that long term—greater than five years—Project C has the most potential. One additional consideration is that Project C also costs less, which may provide an additional $25,000 to invest in other projects, and which could also increase the total return for a $150,000 investment. This simple example illustrates one benefit of using decision models with Excel.

Data-Driven Regression Models

In the previous discussion of mathematical models, we had an equation of the form

MarchSales = 1.10 * PreviousMarchSales

This type of equation is called a linear regression model with a single independent variable. The general form of this type of model is

y = m*x + b

In the MarchSales example, b = 0. A more generalized form of this mathematical model will contain multiple independent variables and will have a general form of

y = b + m1*x1 + m2*x2 + m3*x3 ...

In the real world, however, not all relationships are linear, so there are also mathematical equations that express non-linear relationships between independent variables and a dependent variable. These are a few of the non-linear equations supported by Excel.

  • y = mx2 + b

  • y = mlog(x) + b

  • y = mx + b

One of the challenges for organizations is how to determine which mathematical equation models the real world the best. From the real world, we have a set of data points, and we can record those data points in Excel. The task is to take those data points and develop the best mathematical model, which can then be used to make decisions and predict future outcomes. We call this data-driven modeling because we begin with data points and from those data points we develop a mathematical model.

Excel is an excellent tool for developing data-driven models. Current versions of Excel can handle up to a million rows of data and over sixteen thousand columns. Let’s observe a simple linear regression model using Excel. In this example, we will model Unit Sales of Smartphones as a function of Advertising Budget. We recognize that this is probably a non-linear relationship because more and more advertising will reach a point of diminishing returns. But for simplicity, let’s assume a linear relationship—at least over the range of the available data points. The form of the equation will be:

UnitSales = m*AdvertisingBudget + BaseUnits

The BaseUnits is how many units will be sold with no advertising.

Figure 0.8: Smartphone Sales as a Function of Advertising

In this example, there are 25 data points of Unit Sales as a function of Advertising Budget or expenditures. Figure 0.8 only shows some of the data points. It does contain a scatter chart of the 25 data points along with the best fit trendline. In the figure we have also included the regression equation for this trendline. The equation is:

y = 0.0408*x + 50085

where y is unit sales, x is AdvertisingBudget, and BaseUnits = 50085.

This example illustrates how Excel can develop a mathematical model based on existing data points. More sophisticated models can also be developed with multiple variables or with non-linear equations. Of course, multiple variable models cannot be illustrated in a two-dimensional chart, but can still be developed with Excel. This example is presented only to teach the concept of data-driven modeling. In this course, you will learn the details and develop the skills to develop many different types of models using the tools provided in Excel.